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Abstract 

A efficient incremental learning algorithm for classification tasks, called 
NetLines, well adapted for both binary and real-valued input patterns 
is presented. It generates small compact feedforward neural networks 
with one hidden layer of binary units and binary output units. A 
convergence theorem ensures that solutions with a finite number of 
hidden units exist for both binary and real- valued input patterns. An 
implementation for problems with more than two classes, valid for any 
binary classifier, is proposed. The generalization error and the size of 
the resulting networks are compared to the best published results on 
well-known classification benchmarks. Early stopping is shown to de- 
crease overfitting, without improving the generalization performance. 



1 Introduction 

Feedfo^ard neural netvi^orks have been successfully applied to the problem 
of learning from examples pattern classification. The relationship between 
number of weights, learning capacity and network's generalization ability is 
well understood only for the simple perceptron, a single binary unit whose 
output is a sigmoidal function of the weighted sum of its inputs. In this 
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case, efficient learning algoritfims based on theoretical results allow the de- 
termination of the optimal weights. However, simple perceptrons can only 
generalize those (very few) problems in which the input patterns are linearly 
separable (LS). In many actual classification tasks, multilayered perceptrons 
with hidden units are needed. However, neither the architecture (number of 
units, number of layers) nor the functions that hidden units have to learn are 
known a priori, and the theoretical understanding of these networks is not 
enough to provide useful hints. 

Although pattern classification is an intrinsically discrete task, it may be 
casted as a problem of function approximation or regression, by assigning 
real values to the targets. This is the approach used by Backpropagation 
and related algorithms, which minimize the squared training error of the 
output units. The approximating function must be highly non-linear, as it 
has to fit a constant value inside the domains of each class, and present a 
large variation at the boundaries between classes. For example, in a binary 
classification task in which the two classes are coded as +1 and —1, the 
approximating function must be constant and positive in the input space 
regions or domains corresponding to class 1, and constant and negative for 
those of class —1. The network's weights are trained to fit this function 
everywhere, in particular inside the class-domains, instead of concentrating 
on the relevant problem of the determination of the frontiers between classes. 
As the number of parameters needed for the fit is not known a priori, it is 
tempting to train a large number of weights, that allow to span, at least in 
principle, a large set of functions which is expected to contain "true" one. 
This introduces a small bias[l7j, but leaves us with the difficult problem of 
minimizing a cost function in a high dimensional space, with the risk that the 
algorithm gets stuck in spurious local minima, whose number grows with the 
number of weights. In practice, the best generalizer is determined through a 
trial and error process in which both the number of neurons and weights are 
varied. 

An alternative approach is provided by incremental, adaptive or growth 
algorithms, in which the hidden units are successively added to the network. 
One advantage is fast learning, not only because the problem is reduced to 
training simple perceptrons, but also because adaptive procedures do not 
need the trial and error search for the most convenient architecture. Growth 
algorithms allow the use of binary hidden neurons, well suited for building 
hardware dedicated devices. Each binary unit determines a domain bound- 
ary in input space. Patterns lying on either side of the boundary are given 
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different hidden states. Thus, all the patterns inside a domain in input space 
are mapped to the same internal representation (IR). This binary encod- 
ing is different for each domain. The output unit performs a logic (binary) 
function of these IRs, a feature that may be useful for rule extraction. As 
there is not a unique way of associating IRs to the input patterns, different 
incremental learning algorithms propose different targets to be learnt by the 
appended hidden neurons. This is not the only difference: several heuristics 
exist that generate fully connected feedforward networks with one or more 
layers, and tree-like architectures with different types of neurons (linear, ra- 
dial basis functions). Most of these algorithms are not optimal with respect 
to the number of weights or hidden units. Indeed, growth algorithms have 
often been criticized because they may generate too large networks, generally 
believed to be bad generalizers because of overfitting. 

The aim of this paper is to present a new incremental learning algorithm 
for binary classification tasks, that generates small feedforward networks. 
These networks have a single hidden layer of binary neurons fully connected 
to the inputs, and a single output neuron connected to the hidden units. We 
propose to call it NetLines, for Neural Encoder Through Linear Separations. 
During the learning process, the targets that each appended hidden unit has 
to learn help to decrease the number of classification errors of the output 
neuron. The crucial test for any learning algorithm is the generalization 
ability of the resulting network. It turns out that the networks built with 
NetLines are generally smaller, and generalize better, than the best networks 
found so far on well-known benchmarks. Thus, large networks do not nec- 
essarily follow from growth heuristics. On the other hand, although smaller 
networks may be generated with NetLines through early stopping, we found 
that they do not generalize better than the networks that were trained until 
the number of training errors vanished. Thus, overfitting does not necessarily 
spoil the network's performance. This surprising result is in good agreement 
with recent work on the bias/variance dilemma [13] showing that, unlike in 
regression problems where bias and variance compete in the determination of 
the optimal generalizer, in the case of classification they combine in a highly 
non linear way. 

Although NetLines creates networks for two-class problems, multi-class 
problems may be solved using any strategy that combines binary classifiers, 
like winner-takes-all. In the present work we propose a more involved ap- 
proach, through the construction of a tree of networks, that may be coupled 
with any binary classifier. 
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NetLines is an efficient approach to create small compact classifiers for 
problems with binary or continuous inputs. It is most suited for problems 
where a discrete classification decision is required. Although it may estimate 



posterior probabilities, as discussed in section 12.61 it requires more informa- 
tion than the bare network's output. Another weakness of NetLines is that 
it is not simple to retrain the network when, for example, new patterns are 
available or class priors change over time. 

The paper is organized as follows: in section[2]we give the basic definitions 
and present a simple example of our strategy. This is followed by the formal 
presentation of the growth heuristics and the perceptron learning algorithm 
used to train the individual units. In section [3] we compare NetLines to 
other growth strategies. The construction of trees of networks for multi-class 
problems is presented in section [H A comparison of the generalization error 
and the network's size, with respect to results obtained with other learning 
procedures, is presented in section [5l The conclusions are left to section [6l 

2 The Incremental Learning Strategy 
2.1 Definitions 

We first present our notation and basic definitions. We are given a training set 
of P input-output examples {^'^, r^}, where /i = 1, 2, ■ ■ ■ , P. The inputs = 
(l)^f)C2 5 ■ ■ ' j^n) ™ciy be binary or real valued + 1 dimensional vectors. 
The first component = 1, the same for all the patterns, allows to treat the 
bias as a supplementary weight. The outputs are binary, t'^ = ±1. These 
patterns are used to learn the classification task with the growth algorithm. 
Assume that, at a given stage of the learning process, the network has already 
h binary neurons in the hidden layer. These neurons are connected to the 
1 input units through synaptic weights Wk = {wko, Wki ■ ■ ■ WkN), 1 < ^ < 
h, Wko being the bias. 

Then, given an input pattern ^, the states cxfc of the hidden neurons 
{I < k < h) given by 



\i=0 / 

define the pattern's /i-dimensional IR, = (1, ai, . . . , ah)- The network's 
output is: 




(1) 
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Hereafter, a^{h) 
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o^) is the /i-dimensional IR associated by the 



network of h hidden units to pattern During the training process, h 
increases through the addition of hidden neurons, and we denote H the final 
number of hidden units. 



2.2 Example 

Let us first describe the general strategy on a schematic example (fig. [T]). 
Patterns in the grey region belong to class r = +1, the others to r = — 1. 
The algorithm proceeds as follows: a first hidden unit is trained to separate 
the input patterns at best, and finds one solution, say represented on 
fig. [H by the line labelled 1, with the arrow pointing into the positive half- 
space. As there remain training errors, a second hidden neuron is introduced. 
It is trained to learn targets r2 = +1 for patterns well classified by the first 
neuron, r2 = — 1 for the others (the opposite convention could be adopted, 
both being strictly equivalent), and suppose that solution Wi is found. Then 
an output unit is connected to the two hidden neurons and is trained with 
the original targets. Clearly it will fail to separate correctly all the patterns 
because the IR (—1, 1) is not faithful, as patterns of both classes are mapped 
onto it. The output neuron is dropped, and a third hidden unit is appended 
and trained with targets rs = +1 for patterns that were correctly classified by 
the output neuron and ra = — 1 for the others. Solution W3 is found, and it is 
easy to see that now the IRs are faithful, i.e. patterns belonging to different 
classes are given different IRs. The algorithm converged with 3 hidden units 
that define 3 domain boundaries determining 6 regions or domains in the 
input space. It is straightforward to verify that the IRs correspondent to 
each domain, indicated on fig. [T], are linearly separable. Thus, the output 
unit will find the correct solution to the training problem. If the faithful 
IRs were not linearly separable, the output unit would not find a solution 
without training errors, and the algorithm would go on appending hidden 
units that should learn targets r = 1 for well learnt patterns, and r = — 1 
for the others. A proof that a solution to this strategy with a finite number 
of hidden units exists is left to the Appendix. 
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Figure 1: Example: patterns inside the grey region belong to one class, those 
in the white region to the other. The lines (labelled 1, 2 and 3) represent 
the hyperplanes found with the NetLines strategy. The arrows point into the 
correspondent positive half-spaces. The IRs of each domain are indicated 
(the first component, ctq = 1, is omitted for clarity). 

2.3 The Algorithm NetLines 

Like most adaptive learning algorithms, NetLines combines a growth heuris- 
tics with a particular learning algorithm for training the individual units, 
which are simple perceptrons. In this section we present the growth heuris- 
tics first, followed by the description of Minimerror, our perceptron learning 
algorithm. 

Let us first introduce the following useful remark: if a neuron has to learn 
a target r, and the learnt state turns out to be cr, then the product ar — 1 
if the target has been correctly learnt, and ar — —1 otherwise. 

Given a maximal accepted number of hidden units, Hj^ax, and a maximal 
number of tolerated training errors, E^ax, the algorithm may be summarized 
as follows: 

Algorithm NetLines 

• initialize 
h = 0; 

set the targets r^, — t'^ ioi /i — 1, ■ ■ ■ , P; 
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• repeat 

1. /* train the hidden units */ 

h = h + 1; /* connect hidden unit h to the inputs */ 
learn the training set {^^, rj^}, fi = 1, ■ ■ ■ , P; 
after learning, = sign{wh ■ ^^), = 1, ■ ■ ■ , P; 
if /i = 1 /* for the first hidden neuron */ 

if ai = rf then stop. /* the training set is LS */; 
else set r^_,_]^ = cr^r^ for = 1, ■ ■ ■ , P; go to 1; 
end if 

2. /* learn the mapping between the IRs and the outputs */ 
connect the output neuron to the h trained hidden units; 
learn the training set r^}; /i = 1, ■ ■ ■ , P; 

after learning, C''(^) = sign (W{h) ■ a^j , /i = 1, ■ ■ ■ , P; 
setr,\i = C^r'^for/i = l,---,P; 

count the number of training errors e = — r^_|_]^)/2; 

• until {h = Hmax or e < Emax)] 

The generated network has H = h hidden units. In the Appendix we present 
a solution to the learning strategy with a bounded number of hidden units. In 
practice the algorithm ends up with much smaller networks than this upper 
bound, as will be shown in Section El 

2.4 The perceptron learning algorithm 

The final number of hidden neurons, which are simple perceptrons, depends 
on the performance of the learning algorithm used to train them. The best 
solution should minimize the number of errors; if the training set is LS it 
should endow the units with the lowest generalization error. Our incremen- 
tal algorithm uses Minimerror [20] to train the hidden and output units. 
Minimerror is based on the minimization of a cost function E that depends 
on the perceptron weights w through the stabilities of the training patterns. 
If the input vector is and the corresponding target, then the stability 
'j^ of pattern /i is a continuous and derivable function of the weights, given 
by: 
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(3) 



where || w \\= Vw ■ w. The stability is independent of the norm of the weights 
II ||. It measures the distance of the pattern to the separating hyperplane, 
which is normal to w; it is positive if the pattern is well classified, negative 
otherwise. The cost function E is: 



The contribution to E of patterns with large negative stabilities is ~ 1, 
i.e. they are counted as errors, whereas the contribution of patterns with 
large positive stabilities is vanishingly small. Patterns at both sides of the 
hyperplane within a window of width ^ 4T contribute to the cost function 
even if they have positive stability. 

The properties of the global minimum of ((41) have been studied theoret- 
ically with methods of statistical mechanics [2T]. It was shown that in the 
limit T 0, the minimum of E corresponds to the weights that minimize 
the number of training errors. If the training set is LS, these weights are 
not unique [24j. In that case, there is an optimal learning temperature such 
that the weights minimizing E at that temperature endow the perceptron 
with a generalization error numerically indistinguishable from the optimal 
(bayesian) value. 

The algorithm Minimerror [20l [33] implements a minimization of E re- 
stricted to a sub-space of normalized weights, through a gradient descent 
combined with a slow decrease of the temperature T, which is equivalent to 
a deterministic annealing. It has been shown that the convergence is faster if 
patterns with negative stabilities are considered at a temperature T_ larger 
than those with positive stabilities, T+, with a constant ratio 9 = T_/T+. 
The weights and the temperatures are iteratively updated through: 



E = -y 1 - tanh 




2T. 



(4) 



T-\t + l) 



6w{t) 





+ E 



At/7^>0 



cosh2(7'^/2T+) 



(6) 



(7) 



(5) 



w{t) + 6w{t) 
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Notice from ([5]) that only the incorrectly learned patterns at distances shorter 
than ^ 2T_ from the hyperplane, and those correctly learned lying closer 
than ~ 2T+, contribute effectively to learning. The contribution of patterns 
outside this region are vanishingly small. By decreasing the temperature, the 
algorithm selects to learn patterns increasingly localized in the neighborhood 
of the hyperplane, allowing for a highly precise determination of the parame- 
ters defining the hyperplane, which are the neuron's weights. Normalization 
d?!) restricts the search to the sub-space with || w \\= y^ITl^. 

The only adjustable parameters of the algorithm are the temperature ratio 
6 = T^/T^, the learning rate e and the annealing rate 6T~^. In principle 
they should be adapted to each specific problem. However, thanks to our 
normalizing the weights to \/N + 1 and to data standardization (see next 
section), all the problems are brought to the same scale, simplifying the 
choice of the parameters. 



2.5 Data standardization 

Instead of determining the best parameters for each new problem, we stan- 
dardize the input patterns of the training set through a linear transformation, 
applied to each component: 

e = ^^^r^; l<t<N (8) 
The mean (^j) and the variance Af, defined as usual: 



(e.) = ^E^r (9) 

V = ^E(er-te))' = ^E(err-((e.)f (10) 

need only a single pass of the P training patterns to be determined. After 
learning, the inverse transformation is applied to the weights. 



TV 



Wo 



wq- J2 Wi{^i)/Ai 
1=1 



(11) 
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= VnTT- ' _ (12) 

' + Ef=iK-/A,)2 

so that the normalization (Ej) is completely transparent to the user: with the 
transformed weights (fTTl) and ( fT2ll the neural classifier is applied to the data 
in the original user's units, which do not need to be renormalized. 

As a consequence of the weights scaling ((71) and the inputs standardiza- 
tion (l8|), all the problems are automatically rescaled. This allows us to use 
always the same values of Minimerror's parameters, namely, the "standard" 
values e = 0.02, 6T~^ = 10~^ and 6 = 6. They were used throughout this 
paper, the reported results being highly insensitive to slight variations of 
them. However, in some extremely difficult cases, like learning the parity 
in dimensions > 10 and finding the separation of the sonar signals (see 
section [5l), larger values of 6 were needed. 

2.6 Interpretation 

It has been shown [22] that the contribution of each pattern to the cost func- 
tion of Minimerror, [1 — tanh(7^/2T)] /2, may be interpreted as the proba- 
bility of misclassification at the temperature T at which the minimum of the 
cost function has been determined. By analogy, the neuron's prediction on a 
new input ^ may be given a confidence measure by replacing the (unknown) 
pattern stability by its absolute value || 7 || = || w ■ ^ \\ / \\ w ||, which is 
its distance to the hyperplane. This interpretation of the sigmoidal function 
tanh(|| 7 II /2T) as the confidence on the neuron's output is similar to the 
one proposed earlier [TH| within an approach based on information theory. 

The generalization of these ideas to multilayered networks is not straight- 
forward. An estimate of the confidence on the classification by the output 
neuron should include the magnitude of the weighted sums of the hidden 
neurons, as they measure the distances of the input pattern to the domain 
boundaries. However, short distances to the separating hyperplanes are not 
always correlated to low confidence on the network's output. For an example, 
we refer again to figured Consider a pattern lying close to hyperplane 1. A 
small weighted sum on neuron 1 may cast doubt on the classification if the 
pattern's IR is (-++), but not if it is (-+-), as a change of the sign of the 
weighted sum in the latter case will map the pattern to the IR (++-) which, 
being another IR of the same class, will be given the same output by the net- 
work. It is worth noting that the same difficulty is met by the interpretation 
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of the outputs of multilayered perceptions, trained with Backpropagation, as 
posterior probabilities. We do not go on further into this problem, which is 
beyond the scope of this paper. 

3 Comparison with other strategies 

There are few learning algorithms for neural networks composed of binary 
units. To our knowledge, all of them are incremental. In this section we 
give a short overview of some of them, in order to put forward the main 
differences with NetLines. We discuss the growth heuristics first, and the 
individual units training algorithms afterwards. 

The Tiling algorithm [29j introduces hidden layers, one after the other. 
The first neuron of each layer is trained to learn an IR that helps to decrease 
the number of training errors; supplementary hidden units are then appended 
to the layer until the IRs of all the patterns in the training set are faithful. 
This procedure may generate very large networks. The Upstart algorithm 
[TT] introduces successive couples of daughter hidden units between the input 
layer and the previously included hidden units, which become their parents. 
The daughters are trained to correct the parents classification errors, one 
daughter for each class. The obtained network has a tree-like architecture. 
There are two different algorithms implementing the Tilinglike Learning in 
the Parity Machine Offset |28] and MonoPlane ^8\. In both algorithms, 
each appended unit is trained to correct the errors of the previously included 
unit in the same hidden layer, a procedure that has been shown to generate 
a parity machine: the class of the input patterns is the parity of the learnt 
IRs. Unlike Offset, which implements the parity through a second hidden 
layer, that needs to be pruned, MonoPlane goes on adding hidden units (if 
necessary) in the same hidden layer until the number of training errors at 
the output vanishes. Convergence proofs for binary input patterns have been 
produced for all these algorithms. In the case of real-valued input patterns, a 
solution to the parity machine with a bounded number of hidden units also 
exists [19] . 

The rationale behind the construction of the parity machine is that it is 
not worth training the output unit before all the training errors of the hidden 
units have been corrected. However, Marchand et al. [27] pointed out that 
it is not necessary to correct all the errors of the successively trained hidden 
units: it is sufficient that the IRs be faithful and LS. If the output unit 
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is trained immediately after each appended hidden unit, the network may 
discover that the IRs are already faithful and stop adding units. This may be 
seen on the example of figure [TJ None of the parity machine implementations 
would find the solution represented on the figure, as each of the 3 perceptrons 
unlearns systematically part of the patterns learnt by the preceding one. 

To our knowledge. Sequential Learning [27] is the only incremental learn- 
ing algorithm that might find a solution equivalent (although not the same) 
to the one of figure [H In this algorithm, the first unit is trained to separate 
the training set keeping one "pure" half-space, i.e. a half space only contain- 
ing patterns of one class. Wrongly classified patterns, if any, must all lie in 
the other half-space. Each appended neuron is trained to separate wrongly 
classified patterns with this constraint, i.e. keeping always one "pure", error- 
free, half-space. Thus, neurons must be appended in a precise order, making 
the algorithm difficult to implement in practice. For example. Sequential 
Learning applied to the problem of figure [T] needs to impose that the first 
unit finds the weights W3, as this is the only solution satisfying the purity 
restriction. 

Other proposed incremental learning algorithms strive to solve the prob- 
lem with different architectures, and/or with real valued units. For example, 
in the algorithm Cascade Correlation [9], each appended unit is selected 
among a pool of several real-valued neurons, trained to learn the correlation 
between the targets and the training errors. The unit is then connected to 
the input units and to all the other hidden neurons already included in the 
network. 

Another approach to learning classification tasks is through the con- 
struction of decision trees |5], which partition hierarchically the input space 
through successive dichotomies. The neural networks implementations gen- 
erate tree-like architectures. Each neuron of the tree introduces a dichotomy 
of the input space which is treated separately by the children nodes, which 
eventually produce new splits. Besides the weights, the resulting networks 
need to store the decision path. The proposed heuristics [36l [IHl [26] differ in 
the algorithm used to train each node, and/or in the stopping criterium. In 
particular, Neural- Trees [36] may be regarded as a generalization of CART 
[5j in which the hyperplanes are not constrained to be perpendicular to the 
coordinate axis. The heuristics of the Modified Neural Tree Network [TO] . 
similar to Neural- Trees, includes a criterium of early stopping based on a 
confidence measure of the partition. As NetLines considers the whole input 
space to train each hidden unit, it generates domain boundaries which may 
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greatly differ from the splits produced by trees. We are not aware of any 
systematic study nor theoretical comparison of both approaches. 

Other algorithms, like RCE [34], GAL p], Glocal [t| and Growing cells 
|14| propose to cover or mask the input space with hyperspheres of adaptive 
size containing patterns of the same class. These approaches generally end 
up with a very large number of units. Covering Regions by the LP Method 
[30] is a trial and error procedure devised to select the most efficient masks 
among hyperplanes, hyperspheres or hyperellipsoids. The mask's parameters 
are determined through linear programming. 

Many incremental strategies use the Pocket algorithm [15] to train the 
appended units. Its main drawback is that it has no natural stopping condi- 
tion, which is left to the user's patience. The proposed alternative algorithms 
[T2l |3] are not guaranteed to find the best solution to the problem of learn- 
ing. The algorithm used by the Modified Neural Tree Network (MNTN) 
[To] and the ITRULE [18] minimize cost functions similar to (J!]), but using 
different misclassification measures at the place of our stability ([3]) . The es- 
sential difference with Minimerror is that none of these algorithms are able 
to control which patterns contribute to learning, like Minimerror does with 
the temperature. 

4 Generalization to Multi-class Problems 

The usual way to cope with problems having more than two classes, is to 
generate as many networks as classes. Each network is trained to separate 
patterns of one class from all the others, and a winner-takes-all (WTA) strat- 
egy based on the value of output's weighted sum in equation ([2]) is used to 
decide the class if more than one network recognizes the input pattern. As 
we use normalized weights, in our case the output's weighted sum is merely 
the distance of the IR to the separating hyperplane. All the patterns mapped 
to the same IR are given the same output's weighted sum, independently of 
the relative position of the pattern in input space. As already discussed in 
section 12.61 a strong weighted sum on the output neuron is not inconsistent 
with small weighted sums on the hidden neurons. Therefore, a naive WTA 
decision may not give good results, as is shown in the example of section 

We now describe an implementation for the multi-class problem that re- 
sults in a tree-like architecture of networks. It is more involved that the naive 
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WTA, and may be applied to any binary classifier. Suppose that we have 
a problem with C classes. We must choose in which order the classes will 
be learnt, say (ci, C2, ■ • ■ , Cc). This order constitutes a particular learning 
sequence. Given a particular learning sequence, a first network is trained 
to separate class ci, which is given output target ri = +1, from the others 
(which are given targets ri = —1). The opposite convention is equivalent, 
and could equally be used. After training, all the patterns of class Ci are 
eliminated from the training set and we generate a second network trained 
to separate patterns of class C2 from the remaining classes. The procedure, 
reiterated with training sets of decreasing size, generates C — 1 hierarchi- 
cally organized tree of networks (TON): the outputs are ordered sequences 
C = (Ci) C25 ■ ■ ■ 5 Cc-i)- The predicted class of a pattern is q, where i is the 
first network in the sequence having an output +1 (Ci = +1 and Q = —1 for 
j < i), the outputs of the networks with j > i being irrelevant. 

The performance of the TON may depend on the chosen learning se- 
quence. Therefore, it is convenient that an odd number of TONs, trained 
with different learning sequences, compete through a vote. We verified empir- 
ically, as is shown in section [5731 that this vote improves the results obtained 
with each of the individual TONs participating to the vote. Notice that 
our procedure is different from bagging [4j, as all the networks of the TON 
are trained with the same training set, without the need of any resampling 
procedure. 

5 Applications 

Although convergence proofs of learning algorithms are satisfactory on the- 
oretical grounds, they are not a guarantee of good generalization. In fact, 
they only demonstrate that correct learning is possible, but do not address 
the problem of generalization. This last issue still remains quite empirical 
[40l [TTI [T3] . and the generalization performance of learning algorithms is 
usually tested on well known benchmarks [32] . 

We first tested the algorithm on learning the parity function of bits for 
2 < N < 11. It is well known that the smallest network with the architecture 
considered here needs H = N hidden neurons. The optimal architecture was 
found in all the cases. Although this is quite an unusual performance, the 
parity is not a representative problem: learning is exhaustive and generaliza- 
tion cannot be tested. Another test, the classification of sonar signals [23] , 
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revealed the quality of Minimerror, as it solved the problem without hidden 
units. In fact, we found that not only the training set of this benchmark 
is linearly separable, a result already reported [25l [35], but that the com- 
plete data base, i.e. the training and the test sets together, are also linearly 
separable. 

We present next our results, generalization error eg and number of weights, 
on several benchmarks corresponding to different kinds of problems: binary 
classification of binary input patterns, binary classification of real-valued 
input patterns, and multi-class problems. These benchmarks were chosen 
because they served already as a test to many other algorithms, providing 
unbiased results to compare with. The generalization error eg of NetLines 
was estimated as usual, through the fraction of misclassified patterns on a 
test set of data. 

The results are reported as a function of the training sets sizes P whenever 
these sizes are not specified by the benchmark. Besides the generalization 
error eg, averaged over a (specified) number of classifiers trained with ran- 
domly selected training sets, we present also the number of weights of the 
corresponding networks. The latter is a measure of the classifier's complexity, 
as it corresponds to the number of its parameters. 

Training times are usually cited among the characteristics of the training 
algorithms. Only the numbers of epochs used by Backpropagation on two 
of the studied benchmarks have been published; we restrict the comparison 
to these cases. As NetLines only updates N weights per epoch, whereas 
Backpropagation updates all the network's weights, we compare the total 
number of weights updates. They are of the same order of magnitude for 
both algorithms. However, these comparisons should be taken with cau- 
tion. NetLines is a deterministic algorithm: it learns the architecture and 
the weights through one single run, whereas with Backpropagation several 
architectures must be previously investigated, and this time is not included 
in the training time. 

The following notation is used: D is the total number of available pat- 
terns, P the number of training patterns, G the number of test patterns. 

5.1 Binary inputs 

The case of binary input patterns has the property, not shared by real- valued 
inputs, that every pattern may be separated from the others by a single hy- 
perplane. This solution, usually called grand-mother, needs as many hidden 
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units as patterns in the training set. In fact, the convergence proofs for in- 
cremental algorithms in the case of binary input patterns are based on this 
property. 

5.1.1 Monk's problem 

This benchmark, thoroughly studied with many different learning algorithms 
[39] . contains three distinct problems. Each one has an underlying logical 
proposition that depends on six discrete variables, coded with = 17 bi- 
nary numbers. The total number of possible input patterns is -D = 432, and 
the targets correspond to the truth table of the corresponding proposition. 
Both NetLines and MonoPlane found the underlying logical proposition of 
the first two problems, i.e. they generalized correctly giving eg = 0. In fact, 
these are easy problems: all the neural network-based algorithms, and some 
non-neural learning algorithms were reported to correctly generalize them. 
In the third Monk's problem, 6 patterns among the P3 = 122 examples are 
given wrong targets. The generalization error is calculated over the complete 
set oi D = 432 patterns, i.e. including the training patterns, but in the 
test set all the patterns are given the correct targets. Thus, any training 
method that learns correctly the training set will make at least 1.4% of gen- 
eralization errors. Four algorithms specially adapted to noisy problems were 
reported to reach = 0. However, none of them generalizes correctly the 
two other (noiseless) Monk's problems. Besides them, the best performance, 
eg = 0.0277 which corresponds to 12 misclassified patterns, is only reached by 
neural networks methods: Backpropagation, Backpropagation with Weight 
Decay, Cascade-Correlation and NetLines. The number of hidden units gen- 
erated with NetLines (58 weights) is intermediate between Backpropagation 
with weight decay (39), and Cascade-Correlation (75) or Backpropagation 
(77). MonoPlane reached a slightly worse performance {eg = 0.0416, i.e. 18 
misclassified patterns) with the same number of weights as NetLines, showing 
that the parity machine encoding may not be optimal. 

5.1.2 Two or more clumps 

In this problem [6] the network has to discriminate if the number of clumps 
in a ring of N bits is strictly smaller than 2 or not. One clump is a sequence 
of identical bits bounded by bits of the other kind. The patterns are gener- 
ated through a MonteCarlo method in which the mean number of clumps is 
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Figure 2: Two or more clumps for two ring sizes, iV = 10 and = 25. 
Generalization error vs. size of the training set P, for different algorithms. 

= 10: Backpropagation [23, Stepwise [26]. = 25: Tiling [29|, Upstart 
[TT] . Results with the Growth Algorithm ^Tj are indistinguishable from those 
of Tiling at the scale of the figure. Points without error bars correspond to 
best results. Results of MonoPlane and NetLines are averages over 25 tests. 
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Figure 3: Two or more clumps. Number of weights (logarithmic scale) vs. 
size of the training set P, for N = 10 and = 25. Results of MonoPlane 
and NetLines are averages over 25 tests. The references are the same as in 
figure El 



controlled by a parameter k [29j. We generated training sets of P patterns 
with k = 3, corresponding to a mean number of clumps of ~ 1.5, for rings of 
= 10 and = 25 bits. The generalization error corresponding to several 
learning algorithms, estimated with independently generated testing sets of 
the same sizes as the training sets, i.e. G = P, are displayed on figures [2] 
as a function of P. Points with error bars correspond to averages over 25 
independent training sets. Points without error bars correspond to best re- 
sults. NetLines, MonoPlane and Upstart for N = 25 have nearly the same 
performances when trained to reach error-free learning. 

We tested the effect of early stopping by imposing to NetLines a maximal 
number of two hidden units {H = 2). The residual training error is plotted 
on the same figure, as a function of P. It may be seen that early-stopping 
does not help to decrease eg. Overfitting, that arises when NetLines is applied 
until error-free training is reached, does not degrade the network's general- 
ization performance. This behavior is very different from the one of networks 
trained with Eackpropagation. The latter reduces classification learning to 
a regression problem, in which the generalization error can be decomposed 
in two competing terms: bias and variance. With Eackpropagation, early 
stopping helps to decrease overfitting because some hidden neurons do not 
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reach large enough weights to work in the non-linear part of the sigmoidal 
transfer functions. It is well known that all the neurons working in the lin- 
ear part may be replaced by a single linear unit. Thus, with early-stopping, 
the network is equivalent to a smaller one, i.e. having less parameters, with 
all the units working in the non-linear regime. Our results are consistent 
with recent theories [I3] showing that, contrary to regression, the bias and 
variance components of the generalization error in classification combine in 
a highly non-linear way. 

The number of weights used by the different algorithms is plotted on 
a logarithmic scale as a function of P on figures [3l It turns out that the 
strategy of NetLines is slightly better than that of MonoPlane, with respect 
to both generalization performance and network size. 

5.2 Real valued inputs 

We tested NetLines on two problems which have real valued inputs (we in- 
clude graded- valued inputs here). 

5.2.1 Wisconsin Breast Cancer Data Base 

The input patterns of this benchmark [42] have = 9 attributes charac- 
terizing samples of breast cytology, classified as benign or malignant. We 
excluded from the original data base 16 patterns that have the attribute 
{"bare nuclei") missing. Among the remaining D = 683 patterns, the two 
classes are unevenly represented, 65.5% of the examples being benign. We 
studied the generalization performance of networks trained with sets of sev- 
eral sizes P. The P patterns for each learning test were selected at random. 
On figure [4^, the generalization error at classifying the remaining G = D — P 
patterns is displayed as a function of the corresponding number of weights in 
a logarithmic scale. For comparison, we included in the same figure results 
of a single perceptron trained with P = 75 patterns using Minimerror. The 
results, averaged values over 50 independent tests for each P, show that both 
NetLines and MonoPlane have lower eg and less number of parameters than 
other algorithms on this benchmark. 

The total number of weights updates needed by NetLines, including the 
weights of the dropped output units, is 7- 10''; Backpropagation needed ~ 10^ 

m- 



19 



0.06 



0.05 - 



Breast cancer (a) 

Minimerror (P=75) 



0.04 - 



0.03 - 



0.02 - 



0.01 - 



NetLines {P=160) 



Monoplane (P=160) 




0.00 



Monoplane 
_i I I I ■ ■ ■ ■ ' 



_i I I i_ 



Breast cancer (b) 

o Monoplane 
• NetLines 



_i I I I I I I i_ 



10 100 

Number of weights 



1 23456789 10 

Possible values of attribute ^ 



Figure 4: Breast cancer classification, (a) Generalization error eg vs. number 
of weights (logarithmic scale), for P = 525. 1, 2, 3: Rprop with no shortcuts 
[32] : 4, 5, 6: Rprop with shortcuts [32]; 7: Cascade Correlation [7|. For 
comparison, results with smaller training sets, P = 75 (single perceptron) 
and P = 160, patterns are displayed. Results of MonoPlane and NetLines 
are averages over 50 tests, (b) Classification errors vs. possible values of the 
missing attribute "bare nuclei!' for the 16 incomplete patterns, averaged over 
50 independently trained networks. 
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Figure 5: Diabetes diagnosis: Generalization error eg vs. number of weights. 
Results of NetLines are averages over 50 tests. 1,2,3: Rprop no shortcuts, 
4,5,6: Rprop with shortcuts [32] 

The trained network may be used to classify the patterns with missing 
attributes. The number of misclassified patterns among the 16 cases for which 
attribute is missing, is plotted as a function of the possible values of on 
figure [Dd. For large values of there are discrepancies between the medical 
and the network's diagnosis on half the cases. This is an example of the kind 
of information that may be obtained in practical applications. 

5.2.2 Diabetes diagnosis 

This benchmark |[32j contains D = 768 patterns described by iV = 8 real- 
valued attributes, corresponding to ~ 35% of Pima women suffering from 
diabetes, 65% being healthy. Training sets of P = 576 patterns were se- 
lected at random, and generalization was tested on the remaining G = 192 
patterns. The comparison with published results obtained with other algo- 
rithms tested under the same conditions, presented on figure [5l shows that 
NetLines reaches the best performances published so far on this benchmark, 
needing much less parameters. Training times of NetLines are of ~ 10^ up- 
dates. The numbers of updates needed by Rprop [32] range between 4 ■ 10^ 
and 5 ■ 10^, depending on the network's architecture. 
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5.3 Multi-class problems 

We applied our learning algorithm to two different problems, both of three 
classes. We compare the results obtained with a winner-takes-all (WTA) 
classification based on the results of three networks, each one independently 
trained to separate one class from the two others, to the results of the TON 
architectures described in section [H As the number of classes is low, we 
determined the three TONs, corresponding to the three possible learning 
sequences. The vote of the three TONs improves the performances, as ex- 
pected. 

5.3.1 Breiman's Waveform Recognition Problem 

This problem was introduced as a test for the algorithm CART [5]. The input 
patterns are defined by = 21 real-valued amplitudes x{t) observed at reg- 
ularly spaced intervals t = 1,2, ■ ■ ■ , N. Each pattern is a noisy convex linear 
combination of two among three elementary waves (triangular waves centered 
on three different values of t). There are three possible combinations, and 
the pattern's class identifies from which combination it is issued. 

We trained the networks with the same 11 training sets of P = 300 
examples, and generalization was tested on the same independent test set 
of G = 5000, as in [16]. Our results are displayed on figure [6l where only 
results of algorithms reaching < 0.25 in [16] are included. Although it is 
known that, due to the noise, the classification error has a lower bound of 
~ 14% [5], the results of NetLines and MonoPlane presented here correspond 
to error-free training. The networks generated by NetLines have between 3 
and 6 hidden neurons, depending on the training sets. The results obtained 
with a single perceptron trained with Minimerror and with the Perceptron 
learning algorithm, which may be considered as the extreme case of early 
stopping, are hardly improved by the more complex networks. Here again the 
overfitting produced by error-free learning with NetLines does not deteriorate 
the generalization performance. The TONs vote reduces the variance, but 
does not decrease the average eg. 

5.3.2 Fisher's Iris plants database 

In this classical three-class problem, one has to determine the class of Iris 
plants, based on the values of A^ = 4 real-valued attributes. The database 
of D = 150 patterns, contains 50 examples of each class. Networks were 
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Figure 6: Breiman waveforms: Generalization error eg averaged over 11 tests 
vs. number of parameters. Error bars on the number of weights generated 
by NetLines and MonoPlane are not visible at the scale of the figure. 1: 
Linear disc, 2: Perceptron, 3: Backpropagation, 4: Genetic algorithm, 5: 
Quadratic disc, 6: Parzen's kernel. 7: K-NN, 8: Constraint [16] 

trained with P = 149 patterns, and the generalization error is the mean 
value of all the 150 leave-one-out possible tests. Results of eg are displayed 
as a function of the number of weights on figure [71 Error bars are available 
only for our own results. In this hard problem, the vote of the three possible 
TONs trained with the three possible class sequences (see section [41) improves 
the generalization performance. 

6 Conclusion 

We presented an incremental learning algorithm for classification, that we 
call NetLines for Neural Encoder Through Linear Separation. It generates 
small feedforward neural networks with a single hidden layer of binary units 
connected to a binary output neuron. NetLines allows for an automatic adap- 
tation of the neural network to the complexity of the particular task. This 
is achieved by coupling an error correcting strategy for the successive addi- 
tion of hidden neurons with Minimerror, a very efficient perceptron training 
algorithm. Learning is fast, not only because it reduces the problem to that 
of training single perceptrons, but mainly because there is no longer need of 
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Figure 7: Iris database: Generalization error vs. number of parameters. 1: 
Offset, 2: Backpropagation p8|; 4,5: Backpropagation [H]; 3,6: GOT [4T] . 

the usual preliminary tests necessary to determine the correct architecture 
for the particular application. Theorems valid for binary as well as for real- 
valued inputs guarantee the existence of a solution with a bounded number 
of hidden neurons obeying the growth strategy. 

The networks are composed of binary hidden units whose states consti- 
tute a faithful encoding of the input patterns. They implement a mapping 
from the input space to a discrete if-dimensional hidden space, H being 
the number of hidden neurons. Thus, each pattern is labelled with a binary 
word of H bits. This encoding may be seen as a compression of the pattern's 
information. The hidden neurons define linear boundaries, or portions of 
boundaries, between classes in input space. The network's output may be 
given a probabilistic interpretation based on the distance of the patterns to 
these boundaries. 

Tests on several benchmarks showed that the networks generated by our 
incremental strategy are small, in spite of the fact that the hidden neurons 
are appended until error-free learning is reached. Even in those cases where 
the networks obtained with NetLines are larger than those used by other 
algorithms, its generalization error remains among the smallest values re- 
ported. In noisy or difficult problems it may be useful to stop the network's 
growth before the condition of zero training errors is reached. This decreases 
overfitting, as smaller networks (with less parameters) are thus generated. 
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However, the prediction quality (measured by the generalization error) of 
the classifiers generated with NetLines are not improved by early-stopping. 

The results presented in this paper were obtained without cross-validation, 
nor any data manipulation like boosting, bagging or arcing [H [8]. Those 
costly procedures combine results of very large numbers of classifiers, with 
the aim of improving the generalization performance through the reduction of 
the variance. As NetLines is a stable classifier, presenting small variance, we 
do not expect that such techniques would significantly improve our results. 
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In this Appendix we exhibit a particular solution to the learning strategy of 
NetLines. This solution is built in such a way that the cardinal of a convex 
subset of well learnt patterns, L/j, grows monotonically upon the addition 
of hidden units. As this cardinal cannot be larger that the total number of 
training patterns, the algorithm must stop with a finite number of hidden 
units. 

Suppose that h hidden units have already been included and that the 
output neuron still makes classification errors on patterns of the training 
set, called training errors. Among these wrongly learned patterns, be v the 
one closest to the hyperplane normal to Wh-, called hyperplane-/;, hereafter. 
We define Lh as the subset of (correctly learnt) patterns laying closer to the 
hyperplane-/i than . Patterns in Lh have < 7/1 < |7^|. The subset and 
at least pattern u are well learnt if the next hidden unit, h+1, has weights: 

wh+1 = r^H^h - (1 - tt,)Ti:{wh ■ r)eo (13) 

where Cq = (1,0, ■■■,0). The conditions that both and pattern z/ have 
positive stabilities {i.e. be correctly learned) impose that 

< e/, < min ^^^] (14) 

MEL, |7,-| ^ > 

The following weights between the hidden units and the output will give 
the correct output to pattern v and to the patterns of Ly,: 



W^ih + l) = Wo{h) + T'' 

Wi{h + 1) = Wi{h) forl<i<h 



(15) 
(16) 
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Wn+i{h + l) = -t" (17) 

Thus, card{Lh^i) > card{Lh) + 1. As the number of patterns in Lh increases 
monotonically with h, convergence is guaranteed with less than P hidden 
units. 
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